CMSC320 FINAL PROJECT

Moses Kamoga

PREDICTING THE PRICE OF A HOUSE IN WASHINGTON STATE AND DOING SOME ANALYSIS ON THE DATA

INTRODUCTION

The overall objective of this project is to analyze house data from the square footage to the location of a house to be able to better predict the price of a house in the state of washington, and not only give house buyers an estimate of what there budget should entail if they are looking to buy a house in a particular but also give them some useful information of where they get a house with in there budget in a great location.

Through out this tutorial we will fully analyze all the variables in this dataset and understand if they are highly significant in explaining the change in price of a house in washington state. We will also offer more insights in the data by looking at the most expensive and least expensice cities in washington state so that we can properly guide buyers and sellers altogether in their quest to manuever the real estate market in Washington.

Outline

1.0 DATA COLLECTION

1.1) Source of the dataset
1.2) Variables in the dataset
1.3) Background for the interest in this topic

2.0 SETTING THIS UP

2.1) Needed python libraries
2.2) Importing the dataset
2.3) View data
2.4) Clean data

3.0 DATA VISUALIZATION

3.1) Checking for correlation between the response variable(price) and independent variables
3.2) Generating the scatterplots
3.3) Analysing scatterplots

4.0 PERFORMING REGRESSION ON THE DATA

4.1) Analysing the regression model
4.2) Newly fitted model
4.3) Doing the F-Partial Test
4.4) Regression on the Final Model

5.0 FINAL REGRESSION EQUATION FOR THE MODEL

6.0 PUTTING THE FINAL MODEL TO THE TEST

6.1) Confidence interval

7.0 EXPLORING THE TOP MOST EXPENSIVE CITY TO PURCHASE A HOUSE IN WASHINGSTON STATE

7.1) Generating the the desired table to be analysed and visualized
7.2) Histogram to help us visualize the average price of houses in different cities in WA
7.3) Analysing the histogram.

8.0 MOST EXPENSIVE STREETS IN THE MOST EXPENSIVE CITY IN WA

8.1) organising the data
8.2) Histogram to help us visualize the average prices of houses by street in seatle.

9.0 CONCLUSION

1. DATA COLLECTION

1.1) Source of the dataset

The dataset used is got from kaggle and the link is here https://www.kaggle.com/shree1992/housedata#data.csv

1.2) Variables in the dataset

This dataset has 18 columns that is:-

Date column.

The date which data was entered.

Price.

The price of the house.

Bedrooms

The number of bedrooms that a particular house has.

Bathrooms

The number of bathrooms that a house.

Sqft_living

The number of living square feet of a given house.

Sqft_lot

Is the size of a lot or floor space of the house.

Floors

The number of floors that a house has.

Waterfront

Indicates whether the house is a waterfront property or not. It is categorical variable with '0' indicating that the house is not a waterdront property and '1' indicating that the house is a waterfront property.

View

The number of views that the has. This is a continuous variable.

Condition

This indicates the condition that the house is in from 1 - 4. With 1 indicating that the house is not in a good condition and 4 being that the house is in a great condition.

Sqft_above

The square feet of the house above ground.

Sqft_basement

The square feet of the basement that is if a property has a basement. But, if the house has no basement then this variable is denoted by 0.

Yr_built

This is the year the house was built.

Yr_rennovated

This is the year the house was renovated.

street

This denotes the street the house is built on.

city

The city where the particular house is found.

statezip

The denotes the state and the zipcode of a particular house.

country

This denotes the country in which the house is built.

1.3) Background of why i am interested in developing a model to predict the price of houses in Seattle

I have always been fascinated with models that predict future trends. This fascination has overtime turned into an obsession and thus, i knew given the opportunity, knowledge and resources. I always wanted to develop a model to predict future trends. Being a college senior, time is nearing when i am supposed to look at houses to buy for my future family or for investment purposes. And thus, what better way to prepare myself for this journey through the muddy waters of Real estate and understand underlying factors that influence the price of a house. For this tutorial my focus is Washington state.

2. Setting this up.

The response variable is price, and the independent variables are the other variables which are predictors and they will help us in analysing the flactuation of price in terms of the change in these variables.

2.1) Needed python libraries

    - Pandas: used for data display and partitioning
    - Matplotlib - pyplot: used for plotting Pandas data into graphs and charts
    - Seaborn: provides a high-level interface for graphics on top of Matplotlib
    - scikit-learn: very popular machine learning library
    - linear-model: used to calculate models for multiple linear regression
    - model_selection: used to split up dataset into test and training data and evaluate predictions
    - statsmodels - api: used to calculate models and statistics with multple linear regression


2.2) Importing the dataset into the turtorial for analysis

We import the dataset file from kaggle by manually downloading it onto the local computer. From the local computer we store the dataset as a .csv in the folder where this notebook file is located. Then we use a pandas read_csv to read the file into the notebook to be able to carry out the analysis as shown below.

In [1]:
!pip install lxml
!pip install html5lib
!pip install pydotplus
!pip install folium
import folium
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import matplotlib.pyplot as plt
import requests
import lxml
import html5lib
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from PIL import Image
from urllib.request import Request, urlopen
import scipy.stats as stats
import math
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/95/60/552fc6e027bc1158ba4691cccfdf6eb77f206f2f21d3c1c5f23b89f68a0e/lxml-4.4.2-cp37-cp37m-manylinux1_x86_64.whl (5.7MB)
     |████████████████████████████████| 5.8MB 2.1MB/s eta 0:00:01     |██████████████████████████▋     | 4.8MB 2.1MB/s eta 0:00:01     |███████████████████████████▋    | 5.0MB 2.1MB/s eta 0:00:01     |████████████████████████████▋   | 5.1MB 2.1MB/s eta 0:00:01
Installing collected packages: lxml
Successfully installed lxml-4.4.2
Collecting html5lib
  Downloading https://files.pythonhosted.org/packages/a5/62/bbd2be0e7943ec8504b517e62bab011b4946e1258842bc159e5dfde15b96/html5lib-1.0.1-py2.py3-none-any.whl (117kB)
     |████████████████████████████████| 122kB 3.2MB/s eta 0:00:01
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.7/site-packages (from html5lib) (1.12.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.7/site-packages (from html5lib) (0.5.1)
Installing collected packages: html5lib
Successfully installed html5lib-1.0.1
Collecting pydotplus
  Downloading https://files.pythonhosted.org/packages/60/bf/62567830b700d9f6930e9ab6831d6ba256f7b0b730acb37278b0ccdffacf/pydotplus-2.0.2.tar.gz (278kB)
     |████████████████████████████████| 286kB 3.5MB/s eta 0:00:01
Requirement already satisfied: pyparsing>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from pydotplus) (2.4.2)
Building wheels for collected packages: pydotplus
  Building wheel for pydotplus (setup.py) ... done
  Created wheel for pydotplus: filename=pydotplus-2.0.2-cp37-none-any.whl size=24568 sha256=c08aa91f4b5ed3dde0e8f6da5d913686d7000afbbec51d50f5e76cbbe279223b
  Stored in directory: /home/jovyan/.cache/pip/wheels/35/7b/ab/66fb7b2ac1f6df87475b09dc48e707b6e0de80a6d8444e3628
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2
Collecting folium
  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
     |████████████████████████████████| 92kB 3.2MB/s eta 0:00:01
Requirement already satisfied: jinja2>=2.9 in /opt/conda/lib/python3.7/site-packages (from folium) (2.10.1)
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from folium) (2.22.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from folium) (1.17.0)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->folium) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->folium) (1.25.3)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from branca>=0.3.0->folium) (1.12.0)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.1
In [2]:
#Importing the house price prediction dataset in our python project for analysis
df = pd.read_csv('data.csv')

2.3) view data

In [3]:
#The number of rows in our dataset. 
print(df.shape)
(4600, 18)

I will be carrying out the analysis of data on 4600 rows. This is more than enough rows for us to be confident in our findings and the conclusion we will reach at the end of this tutorial.

In [4]:
#Below is how the table of the dataset looks like 
df.head()
Out[4]:
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
0 2014-05-02 00:00:00 313000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA
4 2014-05-02 00:00:00 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA

2.4) cleaning the data.

Our goal is to predict price of a house in Washington state. Just by eyeballing the data. We can immediately see what columns we do not need in our analysis. We definitely do not need the date column. There is no way, it can make sense to use the date of the day the entry was entered to predict the price of a house. The date column has to be eliminated from the dataset. Inspecting the dataset further, the country column is also not necessary. Since all the houses in the dataset are from the USA and more importantly the analysis we are doing is for houses sold in the Washington state which is in the USA.

In conclusion, both the first column and last column which is date and country respectively, have to be deleted. I do not need to carry out any tests, common sense prevails here and does save me alot of time.

In [5]:
#deleting the first column from the dataset.
df = df.drop(['date'], axis = 1)

#deleting the last column from the dataset
df = df.drop(['country'], axis = 1)

#The new look dataset
df.head()
Out[5]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip
0 313000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133
1 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119
2 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042
3 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008
4 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052

From 18 columns we initially started with, we are now down to 16 columns. I will do the analysis on these 16 columns and from these, we will determine which variables are significant in explaining an increase or decrease in the price of the house.

3. DATA VISUALIZATION

3.1) checking for correlation between our response variable(price) and predictors.

Determining the correlation between the response variable and the predictors gives us an insight into what variables highly influence price. Using the scatter plots we can tell if the variable is highly correlated with price. With the scatter plot we can determine this, by eye balling the graphs and determining the level of correlation between the price and the variable.

3.2) Generating the scatter plots.

I use matplotlib do generate the number of subplots that I want to create. In this instance below, I generate 3 columns and 5 rows of the sublots. Now, i create an array and store all the column names of the dataset in the dataset. In this array comes in handy when i am generating the pivot tables that i will use to plot on the subplots. The pivot table have a y-axis of price which is the response variable and an x-axis of the variable that we get from the array, which obtained through iteration over the array. with the pivot table when I execute data.pivot_table().plot(), this will populate the individual graphs as line scatter plots unless specified others in the attribute section.

In [6]:
#making the scatter plots
fig = plt.figure()
#axis = df.plot(subplots=True)
fig, axis = plt.subplots(5, 3, figsize=(20, 20))

columns = []
for column in df.columns:
    if column == 'price':
        continue
    columns.append(column)

for i , ax  in enumerate(axis.flat):
    ##print(variable)
    if (columns[i] != 'price'):
        df.pivot_table('price', index=columns[i]).plot(ax = ax)
<Figure size 432x288 with 0 Axes>

3.3) Analysing the scatterplots

We learn alot about our data by just eyeballing the graphs. Dissecting the graphs one by one to get a better understanding of each variable with our response variable which is 'price'.

1. price vs bedroom graph

From this graph we can see that the price is high at 0 and then it decreases towards 1. From 1, increases and then from 9 and above it increases. This does not make sense at all. We expectant the price of a house to increase as the number of bedrooms increases. This abnormality can attributed to other factors such as the house being a waterfront property or located in a zipcode that is very expensive. Later in this tutorial we will be able to prove this.

2. price vs bathrooms

This graph almost behaves as the same way as the graph of price vs bedrooms. Having many bathrooms in the house does not necessarily mean that the price of the will go up. This can be attributed to the same factors as we discussed earlier in 'price vs bedroom'. We will analyse this in detail more and maybe our evidence or analysis will be support our conclusions here

3. price vs sq_ft living, price vs sqft_lot graphs

These also follow the same trend as the previous graphs. Later on we will be able to make final conclusions after doing more analysis that will help us solidify the fact that we know

4. price vs floor

This graph follows the other graph where the number of floors does not necessarily mean that the price of the house will be high. Others do come into place like where the house is located and everything.

5. price vs waterfront

The more waterfront sides the house has, the more the price of the house is. We can see that is graph is highly correlated. It keeps rising as the price increase and the water front increases. We will prove this later on and the regression analysis we do onn this data should indicate a big positive coefficient to account for this steep rise of the line.

6. price vs view

This graph is not as steep as the 'price vs waterfront' but also is correlated with price. As the number of views increase the price of the house tend to increase though there a few instances where that is not case.

7. price vs condition

We all believe that the better the condition of the house, the higher the price is going to be. The graph clearly indicates that, if the condition of the house is good. Then it is going to cost a little more. There are a few instances that do not go with the norm but the regression later will enable us explain these descreparancies or abnormalities in our graph.

8. price vs sqft_above, price vs sqft_basement.

These graphs are not really clear if they are correlated with price. We do observe random spikes but, it does not tell me if there is a correlation. More analysis on this will definitely give us a clear picture of what is going on here.

9. price vs yr_built, price vs yer_renovated

Both these graphs maybe not really correlate to price at all. This means that the year a house was built or rennovated may not have an affect on the price of the house. We can not really ascertain without doing extra tests but from eyeballing the graphs we can safely say that for now..

10 price vs street, price vs city, price vs statezip

Street, city and statezip are not numerical vartiables and thus just by eyeballing the graphs does not tell us anything about their relationship with the price of the house. We will need to do a lot more analysis and alot more visuals to clearly be able to say how price is influenced by the street the house is built on, the city in which the house is built, and the statezip where the house is located.

4.0 PERFORMING REGRESSION ON THE DATA

After eyeballinng the graphs above and making assumptions and conclusions off it, it is time to put the assumptions and the conclusions we made above to the test. We will achieve this by running a linear regression on the dataset and with real proof, we can acertain that our assumptions or conclusions above are indeed true or it was a bunch of bullshit..haha Also, this will help us in understanding the variables that we could not really say anything earlier how they affect price. Before we do a regression on our data. We will remove three variables that is:- statezip, city and street as we have determined in the graphs above that the price of a house is influenced by the city, street and statezip it is located. With this we can sure that the location of the house affects the price of the house either positively of negatively.

This is due to other factors that are not explained in the model such as, -if the neighbourhood is safe. -if the neighbourhood has great schools. -if the neighbourhood is in a prime location. -the cost of living in the particular state, also does influence the price of the house. If the cost of living is high, the house prices will too, be high and vice versa.

I will do a regression on the rest of the variables to determine how significant they influence the price of a house.

In [7]:
#Generating a regression model for the data.

newTable = df.copy()


#dropping the 3 variables 
newTable = newTable.drop(['street'],1)
newTable = newTable.drop(['city'],1)
newTable = newTable.drop(['statezip'],1)

result = sm.ols(formula="price ~ bathrooms + bedrooms + condition + floors + sqft_above + sqft_basement + sqft_living + sqft_lot + view + waterfront + C(yr_built) + C(yr_renovated) ", data=newTable).fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.244
Model:                            OLS   Adj. R-squared:                  0.213
Method:                 Least Squares   F-statistic:                     7.914
Date:                Fri, 13 Dec 2019   Prob (F-statistic):          3.83e-161
Time:                        03:49:02   Log-Likelihood:                -66800.
No. Observations:                4600   AIC:                         1.340e+05
Df Residuals:                    4419   BIC:                         1.351e+05
Df Model:                         180                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                1389.6200   1.52e+05      0.009      0.993   -2.97e+05       3e+05
C(yr_built)[T.1901]     -7.823e+04    2.1e+05     -0.373      0.709   -4.89e+05    3.33e+05
C(yr_built)[T.1902]     -5292.3236   2.03e+05     -0.026      0.979   -4.02e+05    3.92e+05
C(yr_built)[T.1903]     -2.396e+05   1.97e+05     -1.215      0.225   -6.26e+05    1.47e+05
C(yr_built)[T.1904]     -1.027e+05   2.09e+05     -0.491      0.623   -5.13e+05    3.07e+05
C(yr_built)[T.1905]     -1.851e+04    1.7e+05     -0.109      0.913   -3.52e+05    3.15e+05
C(yr_built)[T.1906]     -1.085e+05   1.72e+05     -0.630      0.529   -4.46e+05    2.29e+05
C(yr_built)[T.1907]     -3.024e+04   1.94e+05     -0.156      0.876   -4.12e+05    3.51e+05
C(yr_built)[T.1908]     -1.267e+05   1.71e+05     -0.741      0.459   -4.62e+05    2.09e+05
C(yr_built)[T.1909]     -2.254e+04    1.7e+05     -0.133      0.894   -3.55e+05     3.1e+05
C(yr_built)[T.1910]     -1.237e+05    1.6e+05     -0.773      0.440   -4.37e+05     1.9e+05
C(yr_built)[T.1911]     -1.368e+05   2.15e+05     -0.637      0.524   -5.58e+05    2.84e+05
C(yr_built)[T.1912]     -1.704e+05   1.62e+05     -1.048      0.294   -4.89e+05    1.48e+05
C(yr_built)[T.1913]     -7.068e+04   2.12e+05     -0.333      0.739   -4.86e+05    3.45e+05
C(yr_built)[T.1914]      5.105e+04   2.69e+05      0.190      0.849   -4.75e+05    5.78e+05
C(yr_built)[T.1915]     -1.031e+04    2.4e+05     -0.043      0.966   -4.81e+05    4.61e+05
C(yr_built)[T.1916]     -1.613e+05   1.82e+05     -0.885      0.376   -5.19e+05    1.96e+05
C(yr_built)[T.1917]     -1.468e+05   1.95e+05     -0.752      0.452   -5.29e+05    2.36e+05
C(yr_built)[T.1918]     -4.424e+04   1.73e+05     -0.256      0.798   -3.83e+05    2.95e+05
C(yr_built)[T.1919]     -1.894e+05   1.79e+05     -1.059      0.290    -5.4e+05    1.61e+05
C(yr_built)[T.1920]     -1.915e+05   1.72e+05     -1.112      0.266   -5.29e+05    1.46e+05
C(yr_built)[T.1921]     -9.499e+04    1.7e+05     -0.560      0.576   -4.28e+05    2.38e+05
C(yr_built)[T.1922]     -1.585e+05   1.82e+05     -0.869      0.385   -5.16e+05    1.99e+05
C(yr_built)[T.1923]     -5.355e+04   1.76e+05     -0.304      0.761   -3.99e+05    2.92e+05
C(yr_built)[T.1924]     -3.686e+04   1.66e+05     -0.222      0.824   -3.62e+05    2.88e+05
C(yr_built)[T.1925]      -1.63e+05   1.69e+05     -0.966      0.334   -4.94e+05    1.68e+05
C(yr_built)[T.1926]     -1.917e+05   1.49e+05     -1.285      0.199   -4.84e+05    1.01e+05
C(yr_built)[T.1927]     -6.304e+04   1.62e+05     -0.390      0.697    -3.8e+05    2.54e+05
C(yr_built)[T.1928]     -1.522e+04   1.73e+05     -0.088      0.930   -3.55e+05    3.25e+05
C(yr_built)[T.1929]     -1.262e+05   1.78e+05     -0.711      0.477   -4.74e+05    2.22e+05
C(yr_built)[T.1930]     -7.343e+04   1.73e+05     -0.425      0.671   -4.12e+05    2.65e+05
C(yr_built)[T.1931]     -1.065e+05   1.78e+05     -0.597      0.551   -4.56e+05    2.43e+05
C(yr_built)[T.1932]     -1.829e+05   2.48e+05     -0.738      0.460   -6.69e+05    3.03e+05
C(yr_built)[T.1933]     -5.519e+04   2.59e+05     -0.213      0.832   -5.64e+05    4.54e+05
C(yr_built)[T.1934]      3.905e+05   5.27e+05      0.741      0.459   -6.43e+05    1.42e+06
C(yr_built)[T.1935]      1.118e+05    2.7e+05      0.414      0.679   -4.18e+05    6.42e+05
C(yr_built)[T.1936]      6.636e+04   3.16e+05      0.210      0.833   -5.52e+05    6.85e+05
C(yr_built)[T.1937]     -1.104e+04   1.71e+05     -0.065      0.948   -3.46e+05    3.24e+05
C(yr_built)[T.1938]     -2.217e+05   1.96e+05     -1.130      0.259   -6.07e+05    1.63e+05
C(yr_built)[T.1939]     -6.795e+04   2.11e+05     -0.322      0.748   -4.82e+05    3.46e+05
C(yr_built)[T.1940]     -1.305e+05   1.63e+05     -0.801      0.423    -4.5e+05    1.89e+05
C(yr_built)[T.1941]      -2.71e+05   1.62e+05     -1.676      0.094   -5.88e+05    4.59e+04
C(yr_built)[T.1942]     -2.269e+05   1.56e+05     -1.453      0.146   -5.33e+05    7.93e+04
C(yr_built)[T.1943]     -2.834e+05   1.57e+05     -1.806      0.071   -5.91e+05    2.43e+04
C(yr_built)[T.1944]     -1.647e+05   1.63e+05     -1.012      0.312   -4.84e+05    1.54e+05
C(yr_built)[T.1945]     -1.929e+05   1.61e+05     -1.195      0.232   -5.09e+05    1.24e+05
C(yr_built)[T.1946]     -4.448e+04   1.68e+05     -0.265      0.791   -3.73e+05    2.84e+05
C(yr_built)[T.1947]     -1.821e+05   1.55e+05     -1.176      0.240   -4.86e+05    1.22e+05
C(yr_built)[T.1948]     -2.987e+05   1.47e+05     -2.037      0.042   -5.86e+05   -1.12e+04
C(yr_built)[T.1949]     -2.001e+05   1.64e+05     -1.217      0.224   -5.22e+05    1.22e+05
C(yr_built)[T.1950]     -2.215e+05   1.43e+05     -1.554      0.120   -5.01e+05     5.8e+04
C(yr_built)[T.1951]     -2.883e+05   1.56e+05     -1.849      0.065   -5.94e+05    1.74e+04
C(yr_built)[T.1952]     -1.617e+05   1.48e+05     -1.090      0.276   -4.52e+05    1.29e+05
C(yr_built)[T.1953]     -2.195e+05   1.55e+05     -1.417      0.156   -5.23e+05    8.41e+04
C(yr_built)[T.1954]     -2.634e+05   1.41e+05     -1.871      0.061   -5.39e+05    1.26e+04
C(yr_built)[T.1955]     -2.493e+05    1.4e+05     -1.786      0.074   -5.23e+05    2.43e+04
C(yr_built)[T.1956]     -2337.7705   1.48e+05     -0.016      0.987   -2.93e+05    2.89e+05
C(yr_built)[T.1957]     -3.091e+05   1.54e+05     -2.011      0.044   -6.11e+05   -7731.423
C(yr_built)[T.1958]     -2.402e+05   1.59e+05     -1.506      0.132   -5.53e+05    7.25e+04
C(yr_built)[T.1959]     -1.936e+05   1.44e+05     -1.347      0.178   -4.75e+05    8.82e+04
C(yr_built)[T.1960]     -3.609e+05   1.53e+05     -2.355      0.019   -6.61e+05   -6.05e+04
C(yr_built)[T.1961]     -3.141e+05   1.56e+05     -2.008      0.045   -6.21e+05   -7454.329
C(yr_built)[T.1962]     -2.822e+05   1.42e+05     -1.984      0.047   -5.61e+05   -3364.242
C(yr_built)[T.1963]     -2.273e+05   1.48e+05     -1.535      0.125   -5.18e+05    6.31e+04
C(yr_built)[T.1964]     -2.398e+05   1.59e+05     -1.506      0.132   -5.52e+05    7.23e+04
C(yr_built)[T.1965]     -3.059e+05   1.56e+05     -1.958      0.050   -6.12e+05     347.078
C(yr_built)[T.1966]     -2.337e+05    1.5e+05     -1.560      0.119   -5.27e+05       6e+04
C(yr_built)[T.1967]     -2.214e+05    1.4e+05     -1.579      0.114   -4.96e+05    5.35e+04
C(yr_built)[T.1968]     -2.889e+05   1.41e+05     -2.049      0.041   -5.65e+05   -1.24e+04
C(yr_built)[T.1969]     -1.633e+05   1.48e+05     -1.105      0.269   -4.53e+05    1.26e+05
C(yr_built)[T.1970]     -2.705e+05   1.64e+05     -1.653      0.098   -5.91e+05    5.04e+04
C(yr_built)[T.1971]     -1.966e+05   1.78e+05     -1.106      0.269   -5.45e+05    1.52e+05
C(yr_built)[T.1972]     -1.992e+05   1.57e+05     -1.269      0.204   -5.07e+05    1.08e+05
C(yr_built)[T.1973]     -2.058e+05    1.5e+05     -1.375      0.169   -4.99e+05    8.77e+04
C(yr_built)[T.1974]     -2.716e+05    1.6e+05     -1.694      0.090   -5.86e+05    4.26e+04
C(yr_built)[T.1975]     -2.098e+05   1.52e+05     -1.384      0.166   -5.07e+05    8.73e+04
C(yr_built)[T.1976]     -2.429e+05   1.54e+05     -1.578      0.115   -5.45e+05     5.9e+04
C(yr_built)[T.1977]     -2.977e+05   1.41e+05     -2.108      0.035   -5.74e+05   -2.09e+04
C(yr_built)[T.1978]      -2.78e+05   1.38e+05     -2.011      0.044   -5.49e+05   -7002.741
C(yr_built)[T.1979]     -3.327e+05   1.46e+05     -2.282      0.023   -6.19e+05   -4.69e+04
C(yr_built)[T.1980]     -2.873e+05    1.5e+05     -1.921      0.055    -5.8e+05    5922.390
C(yr_built)[T.1981]     -2.041e+05   1.56e+05     -1.308      0.191    -5.1e+05    1.02e+05
C(yr_built)[T.1982]     -1.865e+05   1.64e+05     -1.139      0.255   -5.08e+05    1.35e+05
C(yr_built)[T.1983]     -3.229e+05   1.47e+05     -2.201      0.028    -6.1e+05   -3.53e+04
C(yr_built)[T.1984]     -2.896e+05   1.46e+05     -1.991      0.047   -5.75e+05   -4368.730
C(yr_built)[T.1985]     -1.955e+05   1.47e+05     -1.332      0.183   -4.83e+05    9.22e+04
C(yr_built)[T.1986]     -2.559e+05   1.47e+05     -1.742      0.082   -5.44e+05    3.22e+04
C(yr_built)[T.1987]     -2.431e+05   1.43e+05     -1.696      0.090   -5.24e+05    3.79e+04
C(yr_built)[T.1988]     -3.055e+05   1.48e+05     -2.063      0.039   -5.96e+05   -1.52e+04
C(yr_built)[T.1989]     -2.469e+05   1.41e+05     -1.748      0.081   -5.24e+05       3e+04
C(yr_built)[T.1990]     -4.405e+05   1.47e+05     -2.989      0.003   -7.29e+05   -1.52e+05
C(yr_built)[T.1991]     -3.245e+05   1.46e+05     -2.228      0.026    -6.1e+05   -3.89e+04
C(yr_built)[T.1992]      4.525e+05   1.52e+05      2.974      0.003    1.54e+05    7.51e+05
C(yr_built)[T.1993]     -2.836e+05   1.47e+05     -1.926      0.054   -5.72e+05    5134.023
C(yr_built)[T.1994]     -2.738e+05   1.46e+05     -1.871      0.061   -5.61e+05    1.31e+04
C(yr_built)[T.1995]     -2.709e+05   1.55e+05     -1.748      0.081   -5.75e+05    3.29e+04
C(yr_built)[T.1996]     -2.426e+05    1.5e+05     -1.619      0.105   -5.36e+05    5.11e+04
C(yr_built)[T.1997]     -2.241e+05    1.5e+05     -1.493      0.135   -5.18e+05    7.01e+04
C(yr_built)[T.1998]     -4.127e+05   2.03e+05     -2.033      0.042   -8.11e+05   -1.47e+04
C(yr_built)[T.1999]     -2.435e+05   1.46e+05     -1.663      0.096   -5.31e+05    4.36e+04
C(yr_built)[T.2000]     -1.769e+05   1.48e+05     -1.191      0.234   -4.68e+05    1.14e+05
C(yr_built)[T.2001]     -2.429e+05   1.44e+05     -1.689      0.091   -5.25e+05     3.9e+04
C(yr_built)[T.2002]     -3.723e+05   1.48e+05     -2.518      0.012   -6.62e+05   -8.25e+04
C(yr_built)[T.2003]     -2.239e+05    1.4e+05     -1.601      0.109   -4.98e+05    5.02e+04
C(yr_built)[T.2004]       -3.8e+05   1.56e+05     -2.438      0.015   -6.86e+05   -7.44e+04
C(yr_built)[T.2005]     -2.396e+05   1.39e+05     -1.728      0.084   -5.11e+05    3.23e+04
C(yr_built)[T.2006]     -2.601e+05   1.38e+05     -1.883      0.060   -5.31e+05    1.07e+04
C(yr_built)[T.2007]     -2.056e+05    1.4e+05     -1.470      0.142    -4.8e+05    6.86e+04
C(yr_built)[T.2008]     -1.886e+05    1.4e+05     -1.344      0.179   -4.64e+05    8.65e+04
C(yr_built)[T.2009]      -2.77e+05   1.48e+05     -1.874      0.061   -5.67e+05    1.28e+04
C(yr_built)[T.2010]     -1.897e+05   1.61e+05     -1.180      0.238   -5.05e+05    1.25e+05
C(yr_built)[T.2011]     -2.734e+05   1.65e+05     -1.655      0.098   -5.97e+05    5.05e+04
C(yr_built)[T.2012]     -1.504e+05   7.81e+04     -1.926      0.054   -3.04e+05    2704.811
C(yr_built)[T.2013]     -1.314e+05   7.29e+04     -1.802      0.072   -2.74e+05    1.15e+04
C(yr_built)[T.2014]     -2.075e+05   1.42e+05     -1.466      0.143   -4.85e+05    7.01e+04
C(yr_renovated)[T.1912] -1.504e+05   7.81e+04     -1.926      0.054   -3.04e+05    2704.811
C(yr_renovated)[T.1913]  1.915e+04   5.07e+05      0.038      0.970   -9.76e+05    1.01e+06
C(yr_renovated)[T.1923] -1.314e+05   7.29e+04     -1.802      0.072   -2.74e+05    1.15e+04
C(yr_renovated)[T.1934] -3.194e+04   2.42e+05     -0.132      0.895   -5.07e+05    4.43e+05
C(yr_renovated)[T.1945]    5.5e+04   2.82e+05      0.195      0.845   -4.98e+05    6.08e+05
C(yr_renovated)[T.1948]  4.849e+04   5.35e+05      0.091      0.928      -1e+06     1.1e+06
C(yr_renovated)[T.1953] -2.196e+05   5.14e+05     -0.427      0.669   -1.23e+06    7.89e+05
C(yr_renovated)[T.1954] -7.098e+04   2.14e+05     -0.332      0.740    -4.9e+05    3.48e+05
C(yr_renovated)[T.1955] -5.562e+04   3.73e+05     -0.149      0.881   -7.87e+05    6.76e+05
C(yr_renovated)[T.1956] -1.828e+04    1.6e+05     -0.114      0.909   -3.31e+05    2.95e+05
C(yr_renovated)[T.1958]  4.785e+04   2.84e+05      0.169      0.866   -5.08e+05    6.04e+05
C(yr_renovated)[T.1960] -9.266e+04   5.16e+05     -0.180      0.857    -1.1e+06    9.18e+05
C(yr_renovated)[T.1963]  2.094e+04   1.67e+05      0.126      0.900   -3.06e+05    3.48e+05
C(yr_renovated)[T.1966]  1.541e+05   5.35e+05      0.288      0.773   -8.95e+05     1.2e+06
C(yr_renovated)[T.1968]  1.988e+04   1.96e+05      0.102      0.919   -3.64e+05    4.04e+05
C(yr_renovated)[T.1969] -4.099e+04    2.2e+05     -0.187      0.852   -4.72e+05     3.9e+05
C(yr_renovated)[T.1970] -5.498e+04   1.99e+05     -0.276      0.782   -4.45e+05    3.35e+05
C(yr_renovated)[T.1971] -9.749e+04   2.47e+05     -0.394      0.693   -5.82e+05    3.87e+05
C(yr_renovated)[T.1972] -3.198e+04   1.57e+05     -0.203      0.839   -3.41e+05    2.77e+05
C(yr_renovated)[T.1974] -2.256e+05   2.49e+05     -0.906      0.365   -7.14e+05    2.63e+05
C(yr_renovated)[T.1975] -4.178e+05   5.09e+05     -0.821      0.412   -1.42e+06     5.8e+05
C(yr_renovated)[T.1977] -1.062e+05   5.34e+05     -0.199      0.842   -1.15e+06    9.41e+05
C(yr_renovated)[T.1978] -5.558e+05   5.93e+05     -0.937      0.349   -1.72e+06    6.07e+05
C(yr_renovated)[T.1979]  1.577e+05   1.21e+05      1.299      0.194   -8.03e+04    3.96e+05
C(yr_renovated)[T.1980] -3.127e+05   3.64e+05     -0.858      0.391   -1.03e+06    4.02e+05
C(yr_renovated)[T.1981]  4.683e+04   5.34e+05      0.088      0.930      -1e+06    1.09e+06
C(yr_renovated)[T.1982]  -844.4928   1.34e+05     -0.006      0.995   -2.63e+05    2.62e+05
C(yr_renovated)[T.1983]  4730.7617      1e+05      0.047      0.962   -1.92e+05    2.01e+05
C(yr_renovated)[T.1984] -3.083e+04   1.97e+05     -0.157      0.875   -4.16e+05    3.55e+05
C(yr_renovated)[T.1985]  1.028e+05   1.13e+05      0.909      0.363   -1.19e+05    3.25e+05
C(yr_renovated)[T.1986]  1.033e+05   2.44e+05      0.423      0.672   -3.75e+05    5.82e+05
C(yr_renovated)[T.1987]  -2.31e+05   5.31e+05     -0.435      0.663   -1.27e+06    8.09e+05
C(yr_renovated)[T.1988]  3.398e+04   1.05e+05      0.324      0.746   -1.72e+05     2.4e+05
C(yr_renovated)[T.1989] -1976.7121   8.94e+04     -0.022      0.982   -1.77e+05    1.73e+05
C(yr_renovated)[T.1990]  9.759e+04   1.52e+05      0.640      0.522   -2.01e+05    3.97e+05
C(yr_renovated)[T.1991]  6.139e+04   5.37e+05      0.114      0.909   -9.91e+05    1.11e+06
C(yr_renovated)[T.1992] -3789.3731   1.18e+05     -0.032      0.974   -2.35e+05    2.27e+05
C(yr_renovated)[T.1993]  1.384e+05   1.06e+05      1.309      0.191   -6.88e+04    3.46e+05
C(yr_renovated)[T.1994]  2.269e+05   9.05e+04      2.506      0.012    4.94e+04    4.04e+05
C(yr_renovated)[T.1995]  2.802e+05   3.63e+05      0.771      0.441   -4.32e+05    9.93e+05
C(yr_renovated)[T.1996]  5.466e+04   1.44e+05      0.379      0.705   -2.28e+05    3.38e+05
C(yr_renovated)[T.1997]  5.695e+04    1.1e+05      0.517      0.605   -1.59e+05    2.73e+05
C(yr_renovated)[T.1998]  3.132e+04    9.8e+04      0.320      0.749   -1.61e+05    2.23e+05
C(yr_renovated)[T.1999]  5.976e+04   1.03e+05      0.579      0.562   -1.42e+05    2.62e+05
C(yr_renovated)[T.2000]  5.846e+04   5.92e+04      0.987      0.324   -5.76e+04    1.75e+05
C(yr_renovated)[T.2001]  1.605e+05   7.18e+04      2.235      0.025    1.97e+04    3.01e+05
C(yr_renovated)[T.2002]   1.66e+05   9.83e+04      1.689      0.091   -2.67e+04    3.59e+05
C(yr_renovated)[T.2003]  1.355e+05   8.45e+04      1.604      0.109   -3.01e+04    3.01e+05
C(yr_renovated)[T.2004]  7.406e+04   8.26e+04      0.897      0.370   -8.78e+04    2.36e+05
C(yr_renovated)[T.2005]  1.222e+05   8.09e+04      1.511      0.131   -3.63e+04    2.81e+05
C(yr_renovated)[T.2006]  1.041e+05   1.49e+05      0.701      0.484   -1.87e+05    3.95e+05
C(yr_renovated)[T.2007]  1.873e+05   1.96e+05      0.956      0.339   -1.97e+05    5.71e+05
C(yr_renovated)[T.2008]  8.475e+04   9.62e+04      0.881      0.379   -1.04e+05    2.73e+05
C(yr_renovated)[T.2009]   1.64e+05   7.84e+04      2.092      0.036    1.03e+04    3.18e+05
C(yr_renovated)[T.2010]  6.929e+04   1.06e+05      0.654      0.513   -1.39e+05    2.77e+05
C(yr_renovated)[T.2011] -1.623e+04   8.49e+04     -0.191      0.848   -1.83e+05     1.5e+05
C(yr_renovated)[T.2012]  9.563e+04   1.09e+05      0.876      0.381   -1.19e+05     3.1e+05
C(yr_renovated)[T.2013]  6205.7365   8.39e+04      0.074      0.941   -1.58e+05    1.71e+05
C(yr_renovated)[T.2014]  1.125e+05      8e+04      1.406      0.160   -4.44e+04    2.69e+05
bathrooms                5.056e+04   1.78e+04      2.845      0.004    1.57e+04    8.54e+04
bedrooms                -4.962e+04   1.11e+04     -4.478      0.000   -7.13e+04   -2.79e+04
condition                6.405e+04   1.72e+04      3.729      0.000    3.04e+04    9.77e+04
floors                   1.946e+04   2.17e+04      0.898      0.369    -2.3e+04     6.2e+04
sqft_above                 96.2471     10.161      9.472      0.000      76.327     116.167
sqft_basement              67.8372     14.237      4.765      0.000      39.926      95.749
sqft_living               164.0843     10.052     16.324      0.000     144.378     183.791
sqft_lot                   -0.6659      0.219     -3.041      0.002      -1.095      -0.237
view                     4.301e+04   1.13e+04      3.819      0.000    2.09e+04    6.51e+04
waterfront               3.705e+05   9.81e+04      3.776      0.000    1.78e+05    5.63e+05
==============================================================================
Omnibus:                    12686.634   Durbin-Watson:                   1.967
Prob(Omnibus):                  0.000   Jarque-Bera (JB):        551488839.106
Skew:                          34.374   Prob(JB):                         0.00
Kurtosis:                    1697.877   Cond. No.                     1.10e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.72e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

4.1) Analysing the regression model

With the coefficients we can tell if a variable decreases or increases the price of a house, that is, with a negative coefficient, an increase in the units of the variable decreases the price of the house. And with the positive coefficient an increase in the units of the variable will lead to an increase in the price of the house.

From the model above, we can tell bathrooms, condition, floors, sqft_above, sqft_basement, sqft_living, view, waterfront, yr_renovated, these are significant in explaining an increase in the price of a house.

And, bedrooms and yr_built are significant in explaining the decrease in the price of a house. With yr_built, we can use our commonsense to understand and conclude that houses that are old or where built longtime are somewhat more expensive than the houses built recently. Understanding why bedrooms negatively affect the price of the house, we could argue that houses in city center that is apartments with less bedrooms are more expensive than houses in the surburbs. Due to their(houses in the city) location which is very appealing to buyers.

Now, we look at the p-values. These values will enable us determine how significant the the variable is in explaining the change in the price of a house. If the p-value is approximately zero, this means that the variable is significant in explaining the change(increase or decrease) in the price of a house. From the model above, we can see that all the variables are significant in explaining the change(increase or decrease) in the price of a house except yr_rennovated.

Moment of truth

Is this model significant in explaining the response in price?

To answer this question, we look at the R - squared. The value of R - squared is too small that thus, this model is not significant in explaining the response in price.

To understand this, maybe the 3 variables, have influenced the low R - squared value.

Solution.

We fit a new model with all the original variables.

4.2) New fitted model with all the variables.

Fitting a new model including the city, statezip and street. In the previous model, we can also that the p_values for the year built and year renovated were too high and thus we can remove them from the newly fitted model. And the floors p_value is also too high and thus we remove it from the new model. Plus we know, we know that statezip and city are not really needed since the street address of a house that provide that. So, the columns of city and statezip will be removed to eliminate of redundancy in our model. From the street address, we can derive city, state, and zipcode.

In [8]:
#Copying the content of the dataset.
newModel = df.copy()

#fitting the new model without the variables floors, city and state with reasons given above but including street.
result_new_Model = sm.ols(formula="price ~ bathrooms + bedrooms + condition + sqft_above + sqft_basement + sqft_living + sqft_lot + view + waterfront + street ", data=newModel).fit()

#print(result_new_Model.summary())

Understanding this new model

Good news!!!

R-squared is 99% which is very high and thus shows that the model is significant is explaining the change in price. This is what we were looking for, but we still have work to do since there are variables that have a high p-value. If the p-value is not approximately zero then this signifies that particular variable is not significant in explaining the change in price. Thus, we have to remove these variables to remain with only those variables that are significant in explaining the change in price.

Variables in question are:-

-waterfront which has a p-value of 0.117

-sqft_basement which has a p-value of 0.180

-condition which has a p-value of 0.969

-bedrooms which has a p-value of 0.932

The Adj. R-squared does account for the predictors that are not significant in explaining the change in the response variable.

4.3) Doing the F-partial test,

We do the F partial test to prove that our new model is significant in explaining the change in price of a house than the old model. If the p_value is approximately zero. This let us know that our model is better.

Hypothesis

null hypothesis, the original model with the variables waterfront, sqft_basement, condition, bedrooms is significant in explaining the change in price better than the small model.

Alternative hypothesis, the new model without the variables waterfront, sqft_basement, condition, bedrooms is significant in explaining the change in price better than full model.

F-Partial test

the p-value is approximately zero and thus we reject null hypothesis.

Thus with this test, we can fit the new model (smaller model).

4.4) Regression on the Final model

We fit a final model with the varibles above removed.

In [9]:
#Fitting the new model
newModelTable = df.copy()

result_NewModel_Table = sm.ols(formula="price ~ bathrooms + sqft_above  + sqft_lot + view + C(waterfront) + C(street)", data=newModelTable).fit()

#print(result_NewModel_Table.summary())

In this model, R - squared is 99% which means that this model is significant in explaining the change in price. Also, though there is no significant change in the Adj. R - squared but there is a slight increase of 0.2.

5.0 FINAL REGRESSION EQUATION FOR THE MODEL

Below is the equation for the model that can be used to predict the price of the house given the independent variables in the model.

Equation

price_estimate = -4.768e+05 + 3.5e+04(waterfront)+ 1.13e+06(10 W Etruria St)+....+1.15e+05(Valley View Trail) - 1.512e+05(bathrooms) + 519.1622(sqft_above) + 53.2727(sqft_lot) 3.146e+05(view)

where:-

streets are categorical variables, there are like 4600 of them in our regression table and thus, i will not be able to put all of them here. In normal circumstances all the streets are listed in the final equation and if a house belongs on a particular street, in that case, that street is initialised to 1 and all the other streets are initialized to 0.

waterfront is also categorical variable where if the house is located on a waterfront, it is initailized to 1 and if it is not, then it is initialized to 0.

6.0 PUTTING THE MODEL TO THE TEST

We put this model to the test. Now that we have removed the variables that do not significantly explain the change in price. Our generated model is put to the test by getting a real time listing from washington state and see what our model predicts. Below is a real listing of a house in Seattle which has been on the market recently and has just got sold for 615,000 dollars.

Link for the listing --> https://www.zillow.com/homedetails/2022-24th-Ave-E-Seattle-WA-98112/49098189_zpid/?

In [10]:
#Displayed is a real listing that has just sold on 2 december 2019
display(Image.open("Screen Shot 2019-12-09 at 1.14.19 PM.png"))

As pointed out earlier, with the listing we have the street, state, city and the zipcode of where the house was located. From the formula above, we can plug in the variables from the listing..

The street is ---> 24th Ave E Seatle, WE 98112 and from the model the coefficient is ---> 6.979e+05

waterfront ---> the house is not located on a waterfront. So, this variable takes a zero.

The number of bathrooms is ----> 2

Sqft_above ---> 430 this is calculated by subtracting the total sqft(860sqft) from the basement sqft(430sqft).

Sqft_lot ---> 3088

view ---> the number of views that the house has. And in this case, it is 1.

We plug these values into our formular to estimate the price of the house.

In [11]:
price_Estimate = -4.768e+05 + (3.5e+04*0) + (6.979e+05 * 1) - (1.512e+05 * 2) + (53.2727   * 3088) + (3.146e+05  * 1) + (519.1622 * 430)
print(price_Estimate)
621045.8436

Our model estimates the price of the house to be 621045.8436 The house actually sold for 615,000 We can be confident in our model predicting the price of a house with a minimal error.

In [12]:
#The difference of our model and the actual price the house went for is 
diff_price = 621045.8436 - 615000
print(diff_price)
6045.843600000022

6.1) confidence interval

A price estimate like the one we generated above of a house in Seattle does gives us the rough idea of what the actual price will be. But, we know they are prone to errors and running multiple samples is not efficient either. A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence level.

We will calculate the 95% confidence interval, to be certain that the true value of the price of the listed house lies within that interval.

Calculating the confidence interval

With 95% confidence,

We use stats.norm.ppf(q = 0.975) to get the desired z-critical value instead of q = 0.95(95%) because the distribution has two tails. We calculate the standard deviation

In [13]:
#calculating the critical value. 
z_critical = stats.norm.ppf(q = 0.975)  

# Check the z-critical value
print("z-critical value:")              
print(z_critical)                        


price_stdev = df['price'].std()  # Get the population standard deviation

margin_of_error = z_critical * (price_stdev/math.sqrt(len(df)))

confidence_interval = (price_Estimate - margin_of_error,
                       price_Estimate + margin_of_error)  

print(price_stdev)
print(confidence_interval)
z-critical value:
1.959963984540054
563834.7025471414
(604752.0962745857, 637339.5909254143)

7.0 EXPLORING THE TOP MOST EXPENSIVE CITY TO PURCHASE A HOUSE IN WA

From our data after predicting the price of a house in a seatle. We will look at the most expensive cities to live, in seatle. We will do this by calculating the average price of a houses in a city in Washington state. This can be achieved by using groupby and an aggregate function of mean. This adds the price of all houses in a city and divides by the total number of houses in that to get the average price of houses in that particular city in our dataset.

7.1) Generating the the desired table to be analysed and visualized

In [14]:
#gettinng the average price of houses in a given city
sum_Price_Per_City = df.groupby(['city'])['price'].agg('mean').reset_index()

#sum_Price_Per_Street = sum_Price_Per_Street[['CITY','TOT_HOUSE_PRICES']]

sum_Price_Per_City.columns = ['CITY','AVG_HOUSE_PRICES_CITY']
#displaying oonly 20 cities here!
sum_Price_Per_City.head(20)
Out[14]:
CITY AVG_HOUSE_PRICES_CITY
0 Algona 2.072880e+05
1 Auburn 2.993404e+05
2 Beaux Arts Village 7.450000e+05
3 Bellevue 8.471807e+05
4 Black Diamond 3.396056e+05
5 Bothell 4.814419e+05
6 Burien 3.489472e+05
7 Carnation 5.087520e+05
8 Clyde Hill 1.321945e+06
9 Covington 2.962304e+05
10 Des Moines 3.049925e+05
11 Duvall 4.039941e+05
12 Enumclaw 3.076146e+05
13 Fall City 6.926818e+05
14 Federal Way 2.898877e+05
15 Inglewood-Finn Hill 4.250000e+05
16 Issaquah 5.961637e+05
17 Kenmore 4.474940e+05
18 Kent 4.394924e+05
19 Kirkland 6.515836e+05

7.2) Histogram to help us visualize the average price of houses in different cities in WA

In [15]:
#graphing the histogram
plt.figure(figsize=(15,10))
plt.title("Average price of houses in a given city in Washington State", fontsize=16)
sns.barplot(data=sum_Price_Per_City, x='AVG_HOUSE_PRICES_CITY', y='CITY')
plt.show()

7.3) Analysing the histogram.

From the histogram we can immediately observe the disparities in income distribution among the washington state population. Looking at the histogram, Seattle by far has the highest average price of houses. This can be attributed to the booming technology sector in that area. Where big tech giants, the likes of Amazon and Microsoft are headquartered in the area have led to exorbitant house prices.

From a seatle times online editorial authored by Rosenberg, the price of houses in Seattle are expected to even go higher with a looming Amazon second HQ. The only city that tries to come close to seatle is Believue. Its high average price of houses can also be attributed to Amazon because Believue was the first HQ of Amazon before they moved to Seattle. Amazon still maintained a presence in Believue and according to the Business insider article authored by Chan, "Amazon is inching its way back to its original home city". They are planning to build there tallest tower in that city and move all there world wide operations there.

This means that, now is the right time to buy property in Believue before everything goes crazy like what happened in Seattle. As a buyer looking to buy property in Washington state, even though the prices are a little bit high Believue will be a good buy because the future looks bright. And if you are planning to sell living in that area, my advise will be to hold onto your house and wait for the rainfall that is, Amazon's second coming.

8.0 MOST EXPENSIVE STREETS IN THE MOST EXPENSIVE CITY IN WA

Given that Seattle by far is way more expensive than all the other cities in Washington state. Lets explore the expensive and cheap neighbourhoods in that city. That is, if someone really wanted to buy a house in seatle. What neighbourhoods or streets should they look at it, that they can afford.

To plot the histogram to enable us visualize and also understand different pricing of houses on different streets in seatle. We will extract all the rows that have the column 'city' == Seattle from the dataset. Then we will pretty much carry out the calculations the same way we did when we were trying to find out total price of all houses in washington state grouped by city. But, this time we will group by street.

8.1) organising the data

Given, there is alot of streets in Seattle. Having every street on the historgram will not enable us fully analyze the average prices on every street and make informed decisions. So, we will plot the top 75 most expensive streets/neighbourhoods in Seatle. We will sort the values in ascending order and then get the top 75 most expensive streets(streets with a high average price) and these are the ones we will plot on the histogram for analysis.

In [16]:
#Extracting the rows where city = Seattle from the datatset
seatle_Streets_Table = df[(df['city'] == 'Seattle')]


sum_Price_Per_Street = seatle_Streets_Table.groupby(['street'])['price'].agg('mean').reset_index()

sum_Price_Per_Street.columns = ['Street','AVG_HOUSE_PRICES']

#sorting the values in descending so that we ca get the top 75. 
sorted_Price_Per_Street = sum_Price_Per_Street.copy().sort_values('AVG_HOUSE_PRICES', ascending=False)

#table for the top 75 most expensive streets in Seattle.
top_75 = sorted_Price_Per_Street.head(75)
top_75.head()
Out[16]:
Street AVG_HOUSE_PRICES
1027 5426 40th Ave W 1.289900e+07
194 1230 Warren Ave N 3.200000e+06
234 1291 NW Elford Dr 2.888000e+06
580 2826 21st Ave W 2.560498e+06
903 4437 55th Ave NE 2.555000e+06

8.2) Histogram to help us visualize the average prices of houses by street in seatle.

In [17]:
plt.figure(figsize=(15,15))
plt.title("Average price of houses on a given street in Seattle ", fontsize=16)
sns.barplot(data=top_75, x='AVG_HOUSE_PRICES', y='Street')
plt.show()

From the histogram, we can observe that the street with the highest average price of a house is 5426 40th Ave W. The average price is well above 1.2 and it is astronomically higher than the street with the second most high average price. This does indicate the income inequality between various streets of seattle. Where one neighbourhood has astronomically high prices than the other.

9.0 CONCLUSION

Our model predicted the price of a house in seattle with a few thousands off from the actual price that the house really sold for as shown by a real time listing above. Running the regression model, we realised that most of the variables were not significant in explaining the change in the price of a house. But also, we noted that the street on which the house was located, was highly significant in explaining the change in price of a house. And this can be attributed to the fact that they when we got rid of the street column. R-squared was too low to even do any analysis on the model. Further examination of this variable 'street' through histogram gave us a good visual of the significant of the variable in our model.